Exploring White Wines Dataset by Valmik

Univariate Plots Section

First I will take a look at dimensions, column names, structure and summary of the dataset.

## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

The quality is an integer value with median 6 and mean 5.878. Now I will plot the histogram for quality to ascertain the type of distribution.

We can see that most of the quality of values are between 5 and 7. The maximum and minimum value of quality is 3 and 9 respectively. I belive that quality should be an ordered factor since the values are discrete and go from high to low. I will do the conversion to ordered factor now.

##  Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...

Now I will look at histograms of all the other variables starting with fixed acidity to determine their distributions.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Adjusting the binwidth and removing outliers from the above graph.

The fixed acidity clearly follows a normal distribution with mean 6.855. The median value is 6.8.

From now on I will preadjust the binwidth and remove outliers from histograms.

Moving on to volatile acidity

Even this distribution is approximately normal. The mean volatile acidity is 0.2782 while the median is 0.26.

Now I will plot the distribution for citric acid.

Citric acid too follows a normal distribution with mean 0.3342 and median 0.32.

Moving on to residual sugar.

This distribution is highly right skewed which can be confirmed by the fact that there is a high difference between mean (6.391) and median (5.2). I will apply a log transformation now to try and make this into a normal distribution.

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

It can be observed that residual sugar has a bimodal distribution after log transformation.

Now I will plot a histogram of chlorides.

Even this distribution is highly right skewed which. The mean and median values of chlorides are 0.04577 and 0.043 respectively. I will now apply axis transformation to try and convert this to a normal distribution.

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

Chlorides has now been converted to a normal distribution after log scaling.

Moving on to free sulfur dioxide.

Free sulfur dioxide has a normal distribution with median 34.00 and mean 35.31.

Moving on to toal sulfur dioxide.

Total sulfur dioxide also has a normal distribution. The median and mean values are 134 and 138.4 respectively.

Plotting the histogram for density now.

Even density follows and approximate normal distribution with mean 0.994 and median 0.9937.

Moving on to pH now.

pH clearly has an almost perfect normal distribution which is validated by the fact that mean (3.188) and median (3.180) are almost similar.

Plotting the histogram for sulphates now.

Sulphates also follows an approximate normal distribution. The mean and median values are 0.4898 and 0.47 respectively.

Moving on to the last variable alcohol.

Alcohol does not follow a perfect normal distribution but we can approximate it as such. This can be validated by the fact that there is little difference between mean (10.51) and median (10.4).

I would also like to create a new variable for bound sulfur dioxide since we have variables for total and free sulfur dioxide. This new variable could be useful in future analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    78.0   100.0   103.1   125.0   331.0

Bouns sulfur dioxde also follows a normal distribution with mean 103.1 and median 100.

Univariate Analysis

What is the structure of your dataset?

There are 4898 white wine observations with 12 variables for each one. One of the variable, quality, can be considered an ordered factor since it only has discrete integer values ranging from 3 to 9. All other variables are quantitative features with number ranges.

What is/are the main feature(s) of interest in your dataset?

We want to determine a model for predicting quality so quality is of course the most important feature. Other than that I believe that alcohol level will play a significant part in determining the quality of wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

According to research, residual sugar and suphates plays a big role in determine quality of the wine. I expect these features to support my investigation into the feature of interest which is quality.

Did you create any new variables from existing variables in the dataset?

I created bound sulphur dioxide from two existing variables, free sulphur dioxide and total sulphur dioxide since it could help me understand the dataset further and play a big part in future analysis. I also changed quality to an ordered dataset.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The residual sugar and chlorides were the only unusual distributions that didn’t look normal. I transformed these variables to a logarithmic scale since both were highly right skewed. The residual sugar converted to a clear bimodal distribution while chlorides had a normal distribution after transformation.

Bivariate Plots Section

First I will create a scatterplot matrix using ggpairs to explore all the data in one chart.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Below variables have the most significant correlation with quality.

Now we will further explore the relationship between these three variables and quality through box plots and scatter plots.

To make the plots clearer I will add jitter, remove outliers and add transparency. The resultant plots are below.

The quality boxplot shows that when quality increases from 5 to 9, alcohol level also rises slightly with it. This explains the strong correlation between both features. A slightly upward trend in the dense part can also be observed from the scatter plot. Chlorides and density have a looser negative correlation with quality as compared to alcohol. Still a slight decreasing trend can be observed from the above plots.

Now we will move on to examining relationships between features other than quality that have a strong correlation i.e higher than 0.5 in either direction.

The correlation between residual sugar and density is 0.839. Sugar is more dense than other ingredients in the wine. Thus higher sugar levels will lead to higher density which is apparent from the above plot. Also alcohol is less dense as compared to water. So the correlation of -0.78 between alcohol & density and the decreasing trend in the scatter plot makes sense. Bound sulfur dioxide has a high positive correlation with density equal to 0.504. This could be because bound sulfur dioxide also has a negative correlation with alcohol equal to -0.449.

Other than this there is high correlation between total sulfur dioxide & free sulfur dioxide (0.616) and total sulfur dioxide & bound sulfur dioxide (0.922). This is expected since the variables are dependent on each other.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I evaluated various features against the feature of interest in the dataset. The feature of interest, quality, had a relatively strong correlation with alcohol, density and chlorides. Alcohol has a positive correlation while chlorides and density have a negative correlation with quality. Although none of these correlations are exactly linear as can be observed from the box plots.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Density had interesting relationships with multiple variables. Density increased with increasing sugar and bound sulfur dioxide levels while decreased with increasing alcohols levels. This can be explained by higher density of sugar and lower density of alcohol compared to other ingredients.

Other than that bound sulfur dioxide has a negative correlation with alcohol and a positive correlation with density. I believe that during the fermentation process when more and more sugar is converted to alcohol, the levels of bound sulfur dioxide also decrease along with sugar levels.

What was the strongest relationship you found?

The strongest relationship I found was between bound sulfur dioxide and total sulfur dioxide. This is because both these variables are codependent.

Multivariate Plots Section

For the multivariate analysis I will divide the alcohol levels into an ordered factor by dividing it into buckets. Then I will plot line graphs to determine the relation of quality with median density and median chlorides for different alcohol level. This will give us further insights into our feature of interest.

## 'data.frame':    4898 obs. of  16 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ quality.num         : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ bound.sulfur.dioxide: num  125 118 67 139 139 67 106 125 118 101 ...
##  $ alcohol.bucket      : Ord.factor w/ 4 levels "(7,9.5]"<"(9.5,10.4]"<..: 1 1 2 2 2 2 2 1 1 3 ...

The first chart clearly shows that the median density decreases as the alcohol level increases. For lower alcohol levels, density decreases with increasing quality and the trend is consistent across alcohol levels since the lines don’t overlap and follow similar slopes. The trend is more random in higher alcohol levels. The second chart is a bit more complicated In general, the median level of chlorides is higher when the alcohol level is lower. However, this is not the case for the lowest quality level of 3. This might be due to noise in the data since there are only 20 observations of wine with quality 3.

Now I would like to explore one more relationship before concluding the analysis. I will plot a scatterplot between bound sulfur dioxide and alchohol for different quality levels. For this I have divided the quality in to two buckets, (2,5] and (5,9].

## 'data.frame':    4898 obs. of  17 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ quality.num         : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ bound.sulfur.dioxide: num  125 118 67 139 139 67 106 125 118 101 ...
##  $ alcohol.bucket      : Ord.factor w/ 4 levels "(7,9.5]"<"(9.5,10.4]"<..: 1 1 2 2 2 2 2 1 1 3 ...
##  $ quality.bucket      : Ord.factor w/ 2 levels "(2,5]"<"(5,9]": 2 2 2 2 2 2 2 2 2 2 ...

We can clearly see that there is a negative correlation between alcohol and bound sulfur dioxide for lower quality wines Also lower quality wines have higher bound sulfur dioxide content and lower alcohol levels generally. On the other hand there is no strong negative correlation observed between bound sulfur dioxide and alcohol for higher quality wines. The bound sulfur dioxide content is in the same range for all alcohol levels for higher quality wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I was able to explore the relationship of the feature of interest with other variables in detail. Visualizing the relationships between density, chlorides, alcohol and quality concisely allowed me to evaluate them at a deeper level. I determined that the relationship between density and alcohol stays consistent for all quality values. However, the relationship between chlorides and alcohol may change based on the quality.

Were there any interesting or surprising interactions between features?

The interaction between bound sulfur dioxide and alcohol was the most interesting. For lower quality wines there was a negative correlation between bound sulfur dioxide and alcohol levels. But there was no similar trend in higher quality wines. Infact for higher quality wines the bound sulfur dioxide content was more or less in the same range.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I did not create any model with the dataset. The only features of the dataset that could be modeled are the correlations between density and alcohol and density and residual sugar since only these features had strong enough correlation between them. These relationships are of no interest since they can be explained by simple science and don’t contian the feature of interest, quality.


Final Plots and Summary

Plot One

Description One

This is the most informative plot in the dataset, clearly showing the relationship between alcohol content and wine quality. The five boxplots show the alcohol content dropping over wines of quality 3, 4 and 5 before rising steeply again in wines of quality 6, 7 and 8. I have further improved the plot by adding color and proper labels to it.

Plot Two

Description Two

This plot shows another important relationship of our feature of interest, quality. This chart shows that the median density decreases as the alcohol level increases. Also for lower alcohol levels, density decreases with increasing quality and the trend is consistent across alcohol levels. The trend is more random in higher alcohol levels.

Plot Three

Description Three

This chart is the most intersting and surprising to me. It shows the relationship between bound sulfur dioxide and alcohol over at different quality levels. Overall there is a fairly string negative correlation between bound sulfur dioxide and alcohol. We can see from the chart that for lower quality levels there is a strong negative correlation between bound sulfur diocide and alcohol. But what is most surprising is that for higher quality levels the negative correlation is much less strong.


Reflection

This was a great learning exercise for me. In simple words, I learnt how to explore a huge dataset and draw conclusions about relationships between different variables in the dataset.

My major focus in this study was to explore the relationship of quality with other variables in the dataset. Quality has strong correlations with density, chloride and alcohol levels. I was able to successfully explore how quality changes with these variables and draw conclusions about their behaviour.

The univariate and bivariate sections of the analysis were straightforward. But I faced challenges in the multivariate section. When an analyst is evaluating multiple variables at once, there are countless possibilities for structuring the visualization and there is a multitude of variable combinations to investigate. I was able to overcome this difficulty by focusing majorly on the feature of interest and building upon the analysis I did in the bivariate section. Creating a predictive model for quality was also a huge challenge since quality did not have strong enough correlations with any of the other variables.

The most obvious next step in the analysis would be compare this data with the red wine data and find out similar and conflicting trends in determining quality. This will help us in drawing further conclusions. Also a predictive model for quality could be built using machine learning. The more complicated techniques in machine learning will come in handy while dealing with a large number of variables with loose correlation with quality.

Looking back, this was a wonderful exercise to practice my exploratory data analysis abilities while discovering new insights about the world of wines at the same time.